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In this paper, we investigate the cross-media retrieval between images and text, i.e., using image to search 
text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one 
couple of projections, by which the original features of images and text can be projected into a common 
latent space to measure the content similarity However, using the same projections for the two different 
retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their 
best performances. Different from previous works, we propose a modality-dependent cross-media retrieval 
(MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead 
of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and 
the linear regression from one modal space (image or text) to the semantic space, two couples of mappings 
are learned to project images and text from their original feature spaces into two common latent subspaces 
(one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR 
compared with other methods. In particular, based the 4,096 dimensional convolutional neural network 
(CNN) visual feature and 100 dimensional LDA textual feature, the mAP of the proposed method achieves 
41.5%, which is a new state-of-the-art performance on the Wikipedia dataset. 
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but the reliability of his talent is unquestionable 


(a) Using image to search text (I2T) 


Kobe Bryant said, "To be really frank with you, I 
really do not look at it as that, for the simple fact 
that Michael Jordan has really taught me a lot. 
Really taught me a lot. The trainer of his, Tim 
Grover, he's passed on to me and I work with him 
a great deal, and he's shown me a lot. So I can't 
sit there and say, well. I'm trying to catch Michael 
Jordan at six, I want to pass him after six. 



(a) Using text to search images (T2I) 


Fig.l. Cross-media retrieval tasks considered in this paper, (a) Given an image of Iniesta, the task is to find 
some text reports related to this image, (b) Given a text document about Kobe Bryant and Michael Jordan, 
the task is to find some related images about them. Source images, ©ferhat.culfaz: https://goo.gl/of54g4, 
©Basket Streaming: https://goo.gl/DfZLRs, ©Wikipedia: http://goo.gl/D6RYkt. 

1. INTRODUCTION 

With the rapid development of information technology, multi-modal data (e.g., image, 
text, video or audio) have been widely available on the Internet. For example, an image 
often co-occurs with text on a web page to describe the same object or event. Related re¬ 
search has been conducted incrementally in recent decades, among which the retrieval 
across different modalities has attracted much attention and benefited many practical 
applications. However, multi-modal data usually span different feature spaces. This 
heterogeneous characteristic poses a great challenge to cross-media retrieval tasks. 
In this work, we mainly focus on addressing the cross-media retrieval between text 
and images (Fig. 1), i.e., using image (text) to search text documents (images) with the 
similar semantics. 

To address this issue, many approaches have been proposed by learning a com¬ 
mon representation for the data of different modalities. We observe that most exiting 
works [Hardoon et al. 2004; Rasiwasia et al. 2010; Sharma et al. 2012; Gong et al. 2013] 
focus on learning one couple of mapping matrices to project high-dimensional features 
from different modalities into a common latent space. By doing this, the correlations 
of two variables from different modalities can be maximized in the learned common 
latent subspace. However, only considering pair-wise closeness [Hardoon et al. 2004] 
is not sufficient for cross-media retrieval tasks, since it is required that multi-modal 
data from the same semantics should be united in the common latent subspace. Al¬ 
though [Sharma et al. 2012] and [Gong et al. 2013] have proposed to use supervised 
information to cluster the multi-modal data with the same semantics, learning one 
couple of projections may only lead to compromised results for each retrieval task. 

In this paper, we propose a modality-dependent cross-media retrieval (MDCR) 
method, which recommends different treatments for different retrieval tasks, i.e., I2T 
and T2I. Specifically, MDCR is a task-specific method, which learns two couples of 
projections for different retrieval tasks. The proposed method is illustrated in Fig. 2. 
Fig. 2(a) and Fig. 2(c) are two linear regression operations from the image and the text 
feature space to the semantic space, respectively. By doing this, multi-modal data with 
the same semantics can be united in the common latent subspace. Fig. 2(b) is a correla- 
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tion analysis operation to keep pair-wise closeness of multi-modal data in the common 
space. We combine Fig. 2(a) and Fig. 2(b) to learn a couple of projections for I2T, and 
a different couple of projections for T2I is jointly optimized by Fig. 2(b) and Fig. 2(c). 
The reason why we learn two couples of projections rather than one couple for differ¬ 
ent retrieval tasks can be explained as follows. For I2T, we argue that the accurate 
representation of the query (i.e., the image) in the semantic space is more important 
than that of the text to be retrieved. If the semantics of the query is misjudged, it will 
be even harder to retrieve the relevant text. Therefore, only the linear regression term 
from image feature to semantic label vector and the correlation analysis term are con¬ 
sidered for optimizing the mapping matrices for I2T. For T2T, the reason is the same 
as that for I2T. The main contributions of this work are listed as follow: 

• We propose a modality-dependent cross-media retrieval method, which projects data 
of different modalities into a common space so that similarity measurement such as 
Euclidean distance could be applied for cross-media retrieval. 

• To better validate the effectiveness of our proposed MDCR, we compare it with other 
state-of-the-arts based on more powerful feature representations. In particular, with 
the 4,096 dimensional CNN visual feature and 100 dimensional LDA textual fea¬ 
ture, the mAP of the proposed method reaches 41.5%, which is a new state-of-the-art 
performance on the Wikipedia dataset as far as we know. 

• Based on the INRIA-Websearch dataset [Krapac et al. 2010], we construct a new 
dataset for cross-media retrieval evaluation. In addition, all the features utilized in 
this paper are publicly available^. 

The remainder of this paper in organized as follows. We briefly review the re¬ 
lated work of cross-media retrieval in Section 2. In Section 3, the proposed modality- 
dependent cross-media retrieval method is described in detail. Then in Section 4, ex¬ 
perimental results are reported and analyzed. Finally, Section 5 presents the conclu¬ 
sions. 

2. RELATED WORK 

During the past few years, numerous methods have been proposed to address cross¬ 
media retrieval. Some works [Hardoon et al. 2004; Tenenbaum and Freeman 2000; 
Rosipal and Kramer 2006; Yang et al. 2008; Sharma and Jacobs 2011; Hwang and 
Grauman 2010; Rasiwasia et al. 2010; Sharma et al. 2012; Gong et al. 2013; Wei 
et al. 2014; Zhang et al. 2014] try to learn an optimal common latent subspace for 
multi-modal data. This kind of methods projects representations of multiple modali¬ 
ties into an isomorphic space, such that similarity measurement can be directly applied 
between multi-modal data. Two popular approaches. Canonical Correlation Analysis 
(CCA) [Hardoon et al. 2004] and Partial Least Squares (PLS) [Rosipal and Kramer 
2006; Sharma and Jacobs 2011], are usually employed to find a couple of mappings 
to maximize the correlations between two variables. Based on CCA, a number of suc¬ 
cessful algorithms have been developed for cross-media retrieval tasks [Rashtchian 
et al. 2010; Hwang and Grauman 2010; Sharma et al. 2012; Gong et al. 2013]. The 
work [Rashtchian et al. 2010] investigated the cross-media retrieval problem in terms 
of correlation hypothesis and abstraction hypothesis. Based on the isomorphic fea¬ 
ture space obtained from CCA, a multi-class logistic regression is applied to generate 
a common semantic space for cross-media retrieval tasks. In [Hwang and Grauman 
2010], Hwang et al. used KCCA to develop a cross-media retrieval method by model¬ 
ing the correlation between visual features and textual features. The work [Sharma 


^https://sites.google.com/site/yunchaosite/mdcr 
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Linear Regression 



Fig. 2. Modality-dependent cross-media retrieval (MDCR) model proposed in this paper. Images are repre¬ 
sented by square icons, while text is represented by round icons; different colors indicate different classes. 
Ellipse fields with blue color and red color indicate semantic clusters of FootballGame and BasketballGame, 
respectively, (a) Linear regression from image feature space to semantic space to produce a better separa¬ 
tion for images of different classes, (b) Correlation analysis between images and text to keep pair-wise 
closeness, (c) Linear regression from text feature space to semantic space to produce a better separa¬ 
tion for text of different classes. Source images, ©Basket Streaming: https://goo.gl/DfZLRs, ©Wikipedia: 
http://goo.gl/RqWL60, ©Wikipedia: http://goo.gl/k3cPs8, ©Wikipedia: https://goo.gl/RdgsNL. 

et al. 2012] presented a generic framework for multi-modal feature extraction tech¬ 
niques, called Generalized Multiview Analysis (GMA). More recently, the work [Gong 
et al. 2013] proposed a three-view CCA model by introducing a semantic view to pro¬ 
duce a better separation for multi-modal data of different classes in the learned latent 
subspace. 

To address the problem of prohibitively expensive nearest neighbor search, some 
hashing-based approaches [Kumar and Udupa 2011; Wu et al. 2014] to large scale 
similarity search have drawn much interest from the cross-media retrieval commu¬ 
nity. In particular, [Kumar and Udupa 2011] proposed a cross view hashing method to 
generate hash codes by minimizing the distance of hash codes for the similar data and 
maximizing the distance for the dissimilar data. Recently, [Wu et al. 2014] proposed a 
sparse multi-modal hashing method, which can obtain sparse codes for the data across 
different modalities via joint multi-modal dictionary learning, to address cross-modal 
retrieval. Besides, with the development of deep learning, some deep models [Frome 
et al. 2013; Wang et al. 2014; Lu et al. 2014; Zhuang et al. 2014] have also been pro¬ 
posed to address cross-media problems. Specifically, [Frome et al. 2013] presented a 
deep visual-semantic embedding model to identify visual objects using both labeled im¬ 
age data and semantic information obtained from unannotated text documents. [Wang 
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et al. 2014] proposed an effective mapping mechanism, which can capture both intra- 
modal and inter-modal semantic relationships of multi-modal data from heterogeneous 
sources, based on the stacked auto-encoders deep model. 

Beyond the above mentioned models, some other works [Yang et al. 2009; Yang et al. 
2010; Yang et al. 2012; Wu et al. 2013; Zhai et al. 2013; Kang et al. 2014] have also 
been proposed to address cross-media problems. In particular, [Wu et al. 2013] pre¬ 
sented a bi-directional cross-media semantic representation model by optimizing the 
bi-directional list-wise ranking loss with a latent space embedding. In [Zhai et al. 
2013], both the intra-media and the inter-media correlation are explored for cross¬ 
media retrieval. Most recently, [Kang et al. 2014] presented a heterogeneous similarity 
learning approach based on metric learning for cross-media retrieval. With the convo¬ 
lutional neural network (CNN) visual feature, some new state-of-the-art cross-media 
retrieval results have been achieved in [Kang et al. 2014]. 

3. MODALITY-DEPENDENT CROSS-MEDIA RETRIEVAL 

In this section, we detail the proposed supervised cross-media retrieval method, which 
we call modality-dependent cross-media retrieval (MDCR). Each pair of image and text 
in the training set is accompanied with semantic information (e.g., class labels). Dif¬ 
ferent from [Gong et al. 2013] which incorporates the semantic information as a third 
view, in this paper, semantic information is employed to determine a common latent 
space with a fixed dimension where samples with the same label can be clustered. 

Suppose we are given a dataset of n data instances, i.e., Q = {(x^, where G 

W and ti e W are original low-level features of image and text document, respectively. 
Let X = [xi, Xn]^ G be the feature matrix of image data, and T = [ti,t^]^ G 
I^nxg feature matrix of text data. Assume that there are c classes in Q. S = 

[si,Sn]^ G is the semantic matrix with the ith row being the semantic vector 
corresponding to x^ and t^. In particular, we set the jth element of s^ as 1, if x^ and ti 
belong to the jth class. 

Definition 1: The cross-media retrieval problem is to learn two optimal mapping ma¬ 
trices V G and Ik G from the multi-modal dataset Q, which can be formally 
formulated into the following optimization framework: 

min f{V, W) = C{V, W) + C{V, W) + n{V, W), (1) 

v,w 

where / is the objective function consisting of three terms. In particular, C{V, W) is a 
correlation analysis term used to keep pair-wise closeness of multi-modal data in the 
common latent subspace. jC{V, W) is a linear regression term from one modal feature 
space (image or text) to the semantic space, used to centralize the multi-modal data 
with the same semantics in the common latent subspace. 7^(1^, W) is the regularization 
term to control the complexity of the mapping matrices V and W. 

In the following subsections, we will detail the two algorithms for I2T and T2I based 
on the optimization framework Eq.(l). 

3.1. Algorithm for I2T 

This section addresses the cross-media retrieval problem of using an image to retrieve 
its related text documents. Denote the two optimal mapping matrices for images and 
text as Vi G and Wi G respectively. Based on the optimization framework 

Eq.(l), the objective function of I2T is defined as follows: 

mm / (Vi, =A \\XVl - + (1 - A) \\XVl - 

+ R{Vi,Wi), 
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where 0 < A < 1 is a tradeoff parameter to balance the importance of the correlation 
analysis term and the linear regression term, || • ||^ denotes the Frobenius norm of the 
matrix, and R{Vi, Wi) is the regularization function used to regularize the mapping 
matrices. In this paper, the regularization function is defined as: 

R{v,,w,) = m\\Vi\\l + m\\Wr\\l, 

where r]i and r ]2 are nonnegative parameters to balance these two regularization terms. 

3.2. Algorithm for T2I 

This section addresses the cross-media retrieval problem of using text to retrieve its 
related images. Different from the objective function of I2T, the linear regression term 
for T2I is a regression operation from the textual space to the semantic space. Denote 
the two optimal mapping matrices for images and text in T2I as 1^2 G and W 2 G 
I^cxg, respectively. Based on the optimization framework Eq.(l), the objective function 
of T2I is defined as follows: 

mm / (1^2, W 2 ) =A 11X^2^ - + (1 - A) \\TW^ - 

+ R{V2,W2), 

where the setting of the tradeoff parameter A and the regularization function 
R{V 2 ,W 2 ) are consistent with the setting presented in Section 3.1. 

3.3. Optimization 

The optimization problems for I2T and T2I are unconstrained optimization with re¬ 
spect to two matrices. Hence, both Eq.(2) and Eq.(3) are non-convex optimization prob¬ 
lems and only have many local optimal solutions. For the non-convex problem, we 
usually design algorithms to seek stationary points. We note that Eq.(2) is convex with 
respect to either Vi or Wi while fixing the other. Similarly, Eq.(3) is also convex with 
respect to either V 2 or W 2 while fixing the other. Specifically, by fixing Vi(V 2 ) or Wi(W 2 ), 
the minimization over the other can be finished with the gradient descent method. 


The partial derivatives ofVi or Wi over Eq.(2) are given as follows: 

VvJ {Vi,Wi) = ViX'^X + 2 [rjiVi - XWiT^X - (1 - A) S^X] , (4) 

XwJ{Vi, Wi) = 2 [ri2Wi + A [W^T'^T - ViX'^T)] . (5) 

Similarly, the partial derivatives of IA 2 or W 2 over Eq.(3) are given as follows: 

Vv^f (V 2 , W 2 ) = 2 [ 771^2 + A (V 2 X^X - W 2 TTX)] , (6) 

^wJ{V2, W 2 ) = WT^T + 2 [r]2W2 - AVsX^T - (1 - X)S'^T] . (7) 


A common way to solve this kind of optimization problems is an alternating updating 
process until the result converges. Algorithm 1 summarizes the optimization procedure 
of the proposed MDCR method for I2T, which can be easily extended for T2I. 

4. EXPERIMENTAL RESULTS 

To evaluate the proposed MDCR algorithm, we systematically compare it with other 
state-of-the-art methods on three datasets, i.e., Wikipedia [Rasiwasia et al. 2010], Pas¬ 
cal Sentence [Rashtchian et al. 2010] and a subset of INRIA-Websearch [Krapac et al. 
2010 ]. 
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ALGORITHM 1: Optimization for Modality-dependent Cross-media Retrieval 

Input: The feature matrix of image data X = [xi,Xn]^ G the feature matrix of text 

data T = [ti, tn]^ G the semantic matrix corresponding to images and text 

S= [si,...,s,]^ 

Initialize v <— 0 and lo ^0. Set the parameters A, rji, 772, n and e. /u is the step size in 

the alternating updating process and e is the convergence condition. 

repeat 

Alternative optimization process for I2T (Algorithm 2). 
until Convergence or maximum iteration number achieves .; 

Output: 


ALGORITHM 2: Alternative Optimization Process for I2T 

repeat 

Set valuel = / ; 

Update 1//"+') = 

Set value2 = f w G- w + 1; 

until valuel — value2 < e; 

repeat 

Set valuel = f ; 

Update 

Set value2 = f ^ w + 1; 

until valuel — value2 < e; 


4.1. Datasets 

Wikipedia^: This dataset contains totally 2,866 image-text pairs from 10 categories. 
The whole dataset is randomly split into a training set and a test set with 2,173 and 
693 pairs. We utilize the publicly available features provided by [Rasiwasia et al. 2010] 
i.e., 128 dimensional SIFT BoVW for images and 10 dimensional LDA for text, to com¬ 
pare directly with existing results. Besides, we also present the cross-media retrieval 
results based on the 4,096 dimensional CNN visual features^ and the 100 dimensional 
Latent Dirichlet Allocation model (LDA) [Blei et al. 2003] textual features (we firstly 
obtain the textual feature vector based on 500 tokens and then LDA model is used to 
compute the probability of each document under 100 topics). 

Pascal Sentence^: This dataset contains 1,000 pairs of image and text descriptions 
from 20 categories (50 for each category). We randomly select 30 pairs from each cate¬ 
gory as the training set and the rest are taken as the testing set. We utilize the 4,096 
dimensional CNN visual feature for image representation. For textual features, we 
firstly extract the feature vector based on 300 most frequent tokens (with stop words 
removed) and then utilize the LDA to compute the probability of each document under 
100 topics. The 100 dimensional probability vector is used for textual representation. 


^http://www.svcl.ucsd.edu/projects/crossmodal/ 

^The CNN model is pre-trained on ImageNet. We utilize the outputs from the second fully-connected layer 
as the CNN visual feature in this paper. For more details, please refer to [Krizhevsky et al. 2012]. 

^ http ://vision. cs. uiuc. edu/pascal-sentences/ 
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Table I. mAP scores for image and text query on the Wikipedia dataset based on the publicy available 
featrues. 


Query 

PLS 

BLM 

CCA 

SM 

SCM 

GMMFA 

GMLDA 

T-V CCA 

MDCR 

Image 

0.207 

0.237 

0.182 

0.225 

0.277 

0.264 

0.272 

0.228 

0.287 

Text 

0.192 

0.144 

0.209 

0.223 

0.226 

0.231 

0.232 

0.205 

0.225 

Average 

0.199 

0.191 

0.196 

0.224 

0.252 

0.248 

0.253 

0.217 

0.256 


INRIA-Websearch: This dataset contains 71,478 pairs of image and text annotations 
from 353 categories. We remove those pairs which are marked as irrelevant, and select 
those pairs that belong to any one of the 100 largest categories. Then, we get a subset 
of 14,698 pairs for evaluation. We randomly select 70% pairs from each category as 
the training set (10,332 pairs), and the rest are treated as the testing set (4,366 pairs). 
We utilize the 4,096 dimensional CNN visual feature for image representation. For 
textual features, we firstly obtain the feature vector based on 25,000 most frequent 
tokens (with stop words removed) and then employ the LDA to compute the probability 
of each document under 1,000 topics. 

For semantic representation, the ground-truth labels of each dataset are employed 
to construct semantic vectors (10 dimensions for Wikipedia dataset, 20 dimensions for 
Pascal Sentence dataset, and 100 dimensions for INRIA-Websearch dataset) for pairs 
of image and text. 

4.2. Experimental Settings 

In the experiment. Euclidean distance is used to measure the similarity between fea¬ 
tures in the embedding latent subspace. Retrieval performance is evaluated by mean 
average precision (mAP), which is one of the standard information retrieval metrics. 
Specifically, given a set of queries, the average precision (AP) of each query is defined 
as: 

T.k=iP{^)T'el{k) 

Y.k=irel{k) 

where R is the size of the test dataset. rel{k) = 1 if the item at rank k is relevant, 
rel{k) = 0 otherwise. P{k) denotes the precision of the result ranked at k. We can get 
the mAP score by averaging AP for all queries. 

4.3. Results 

In the experiments, we mainly compare the proposed MDCR with six algorithms, in¬ 
cluding CCA, Semantic Matching (SM) [Rasiwasia et al. 2010], Semantic Correlation 
Matching (SCM) [Rasiwasia et al. 2010], Three-View CCA (T-V CCA) [Gong et al. 2013], 
Generalized Multiview Marginal Fisher Analysis (GMMFA) [Sharma et al. 2012] and 
Generalized Multiview Linear Discriminant Analysis (GMLDA) [Sharma et al. 2012]. 

For the Wikipedia dataset, we firstly compare the proposed MDCR with other meth¬ 
ods based on the publicly available features [Rasiwasia et al. 2010], i.e., 128-SIFT 
BoVW for images and 10-LDA for text. We fix /i = 0.02 and e = 10“^, and experimentally 
set A = 0.1, r]i = 0.5 and 772 = 0.5 for the optimization of I2T, and the parameters for T2I 
are set as A = 0.5, 771 = 0.5 and 772 = 0.5. The mAP scores for each method are shown 
in Table I. It can be seen that our method is more effective compared with other com¬ 
mon space learning methods. To further validate the necessity to be task-specific for 
cross-media retrieval, we evaluate the proposed method in terms of training a unified 
V and W by incorporating both two linear regression terms in Eq.(2) and Eq.(3) into a 
single optimization objective. As shown in Table II, the learned subspaces for I2T and 
T2I could not be used interchangeably and the unified scheme can only achieve com- 
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promised performance for each retrieval task, which cannot compare to the proposed 
modality-dependent scheme. 


Table II. Comparison between MDCR and its unified scheme for cross¬ 
media retrieval on the Wikipedia dataset. 


Wikipedia 

MDCR-Eq.(2) 

MDCR-Eq.(3) 

Unified Scheme 

I2T 

0.287 

0.165 

0.236 

T2I 

0.146 

0.225 

0.216 


As a very popular dataset, Wikipedia has been employed by many other works for 
cross-media retrieval evaluation. With a different train/test division, [Wu et al. 2014] 
achieved an average mAP score of 0.226 (Image Query: 0.227, Text Query: 0.224) 
through a sparse hash model and [Wang et al. 2014] achieved an average mAP score 
of 0.183 (Image Query: 0.187, Text Query: 0.179) through a deep auto-encoder model. 
Besides, some other works utilized their own extracted features (both for images and 
text) for cross-media retrieval evaluation. To further validate the effectiveness of the 
proposed method, we also compare MDCR with other methods based on more power¬ 
ful features, i.e., 4,096-CNN for images and 100-LDA for text. We fix /i = 0.02 and e 
= 10“^, and experimentally set A = 0.1, rji = 0.5 and r ]2 = 0.5 for the optimization 
of I2T and T2I. The comparison results are shown in Table IV. It can be seen that 
some new state-of-the-art performances are achieved by these methods based on the 
new feature representations and the proposed MDCR can also outperform others. In 
addition, we also compare our method with the recent work [Kang et al. 2014], which 
utilizes 4,096-CNN for images and 200-LDA for text, in Table III. We can see that the 
proposed MDCR reaches a new state-of-the-art performance on the Wikipedia dataset. 
Please refer to Fig. 3 for the comparisons of Precision-Recall curves and Fig. 4 for the 
mAP score of each category. Figure 5 gives some successful and failure cases of our 
method. For the image query (the 2nd row), although the query image is categorized 
into Ar^, it is prevailingly characterized by the human figure, i.e., a strong man, which 
has been captured by our method and thus leads to the failure results shown. For the 
text query (the 4th row), there exist many Warfare descriptions in the document such 
as war, army and troops, which can be hardly realted to the label of the query text, i.e. 
Art. 

For the Pascal Sentence dataset and the INRIA-Websearch dataset, we experimen¬ 
tally set A = 0.5, r]i = 0.5, r ]2 = 0.5, /i = 0.02 and e = 10“^ during the alternative op¬ 
timization process for I2T and T2T. The comparison results can be found in Table IV. 
It can be seen that our method is more effective compared with others even on a more 
challenging dataset, i.e., INRIA-Websearch (with 14,698 pairs of multi-media data and 
100 categories). Please refer to Fig. 3 for the comparisons of Precision-Recall curves for 
these two datasets and Fig. 4 for the mAP score of each category on the Pascal Sentence 
dataset. 


Table III. Cross-media retrieval comparition with results of four 
methods reported by [Kang et al. 2014] on the Wikipedia dataset. 


Query 

GMLDA 

GMMFA 

MsAlg 

LRBS 

TSCR 

Image 

0.368 

0.387 

0.373 

0.445 

0.435 

Text 

0.297 

0.311 

0.327 

0.377 

0.394 

Average 

0.332 

0.349 

0.350 

0.411 

0.415 
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Table IV. Comparitions of cross-media retrieval performance. 


Dataset 

Query 

CCA 

SM 

SCM 

T-V CCA 

GMLDA 

GMMFA 

MDCR 

Wikipedia 

Image 

0.226 

0.403 

0.351 

0.310 

0.372 

0.371 

0.435 

Text 

0.246 

0.357 

0.324 

0.316 

0.322 

0.322 

0.394 

Average 

0.236 

0.380 

0.337 

0.313 

0.347 

0.346 

0.415 

Pascal Sentence 

Image 

0.261 

0.426 

0.369 

0.337 

0.456 

0.455 

0.455 

Text 

0.356 

0.467 

0.375 

0.439 

0.448 

0.447 

0.471 

Average 

0.309 

0.446 

0.372 

0.388 

0.452 

0.451 

0.463 

INRIA-Websearch 

Image 

0.274 

0.439 

0.403 

0.329 

0.505 

0.492 

0.520 

Text 

0.392 

0.517 

0.372 

0.500 

0.522 

0.510 

0.551 

Average 

0.333 

0.478 

0.387 

0.415 

0.514 

0.501 

0.535 


Image Query 


Text Query 










- CCA 

- SM 

- SCM 

- T-V CCA 

- GMLDA 






■— 



-M 

DCR 
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(a) Wikipedia dataset 




(b) Pascal Sentence dataset 




(c) INRI A-Web search dataset 


Fig. 3. Precision-Recall curves of the proposed MDCR and compared methods 

5. CONCLUSIONS 

Cross-media retrieval has long been a challenge. In this paper, we focus on design¬ 
ing an effective cross-media retrieval model for images and text, i.e., using image to 
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Text Query Text Query 




Awrage Performance 


Average Performance 


(a) Wikipedia dataset 


(b) Pascal Sentence dataset 


Fig. 4. mAP performance for each class on the Wikipedia dataset and the Pascal Sentence dataset. 

search text (I2T) and using text to search images (T2I). Different from traditional com¬ 
mon space learning algorithms, we propose a modality-dependent scheme which rec¬ 
ommends different treatments for I2T and T2I by learning two couples of projections 
for different cross-media retrieval tasks. Specifically, by jointly optimizing a correla¬ 
tion term (between images and text) and a linear regression term (from one modal 
space, i.e., image or text to the semantic space), two couples of mappings are gained 
for different retrieval tasks. Extensive experiments on the Wikipedia dataset, the Pas¬ 
cal Sentence dataset and the INRIA-Websearch dataset show the superiority of the 
proposed method compared with state-of-the-arts. 
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